Presentation: Tweet"Beyond MapReduce"

Track: Next Generation Analytics / Time: Wednesday 15:20 - 16:10 / Location: Library

Apache Hadoop is the current darling of the "Big Data" world. At its core is the MapReduce computing model for decomposing large data-analysis jobs into smaller tasks and distributing those tasks around a cluster. MapReduce itself was pioneered at Google for indexing the Web and other computations over massive data sets.

The strengths of MapReduce are cost-effective scalability and relative maturity. Its weaknesses are its batch orientation, making it unsuitable for real-time event processing, and the difficulty of implementing data analysis idioms in the MapReduce computing model.

We can address the weaknesses in several ways. First, higher-level programming languages, which provide common query and manipulation abstractions, make it easier to implement MapReduce programs. However, longer term, we need new distributed computing models that are more flexible for different problems and which provide better real-time performance.

We'll review these strengths and weaknesses of MapReduce and the Hadoop implementation, then discuss several emerging alternatives, such as Google's Pregel system for graph processing and Storm for event processing. We'll finish with some speculation about the longer-term future of Big Data.

Download slides

Dean Wampler, TweetBig Dataist, O'Reilly Author

Biography: Dean Wampler

Dean Wampler is a Principal Consultant at Think Big Analytics, where he specializes in "Big Data" problems and tools like Hadoop and Machine Learning. Besides Big Data, he specializes in Scala, the JVM ecosystem, JavaScript, Ruby, functional and object-oriented programming, and Agile methods. Dean is a frequent speaker at industry and academic conferences on these topics. He has a Ph.D. in Physics from the University of Washington.